Gender bias in large language models (LLMs) in adult social care

Sam Rickman
Supervisors: Jose-Luis Fernandez, Juliette Malley
Care Policy Evaluation Centre (CPEC) at LSE

December 2024

Large language models (LLMs) in adult social care

How widespread is this?

  1. June 2024 survey: 4 councils use LLMs in Adult Social Care (ASC).1
  2. 43% of councils see AI benefits in ASC.2
  3. Sep 2024: 5 councils LLMs in social care Privacy Notices.
  4. Sep 2024: The Guardian 7 LAs use ASC LLMs and 25 piloting.3
  5. Dec 2024: 9 councils mention social care LLMs in Privacy Notices.

How are they used?

Home visit
Take notes
Type summary
Enter case note
1. Record with phone
2. AI speech-to-text transcript
3. LLM-generated summary
Proof-read

 

graph TD
    A[How well<br> do they<br> work?]:::largeFont
    A --> B[Accuracy]
    A --> C[Bias]
    C --> D[Gender Bias]
    C --> E[Ethnicity Bias]
    C --> F[Etc.]
    B --> G[Qualitative metrics]
    B --> H[Quantitative metrics]

    classDef whiteText fill:#0e1117,stroke:#ffffff,color:#ffffff,stroke-width:2px;
    class B,C,E,F,G,H whiteText;
    classDef largeFont font-size:48px,fill:#0e1117,stroke:#ffffff,color:#ffffff,stroke-width:0px;
    style D fill:#80ff00,stroke:#2c662d,stroke-width:2px,color:black;

Research question

Is there gender bias in state-of-the-art LLMs, when they are used in adult social care?

The data

  1. Data from a local authority.
  2. All adults who were:
    • Aged 65 years and over by the 31st August 2020
    • Receiving care services in the community for at least a year since the end of 2015.
  3. 3,046 individuals (62% women).

 

3046 individuals
Needs assessments
Services received
Free-text case notes

Information governance

  • Data pseudonymised before egress (names, locations, telephone numbers, NHS numbers)
  • Data Processing Impact Assessment
  • No automated decision-making
  • Details on project website and Privacy Notice
  • Individual opt-out available
  1. NHS Confidentiality Advisory Group (CAG) ✔️
    • Social care data.
  2. NHS Data Access Request Service (DARS) ⌛
    • Linked GP data
  3. LSE research ethics committee ✔️

Quantity of free text data

How do we assess bias?

flowchart TD
    A[How do we assess bias?]
    A --> B[Qualitative methods]
    A --> C[Quantitative methods]
    
    %% Subgraph for quantitative methods to group them neatly
    subgraph Quantitative_Methods[ ]
        C --> D[Counterfactual fairness]
        C --> E[Accuracy equality]
        C --> F[Statistical parity]
        C --> G[Conditional use accuracy]
        C --> H[Procedural accuracy]
        C --> I[Treatment equality]
    end

    classDef whiteText fill:#0e1117,stroke:#ffffff,color:#ffffff,stroke-width:2px;
    class A,B,C,E,F,G,H,I whiteText;
    classDef largeFont font-size:48px,fill:#0e1117,stroke:#ffffff,color:#ffffff,stroke-width:0px;
    style D fill:#80ff00,stroke:#2c662d,stroke-width:2px,color:black,font-weight:bold
    style Quantitative_Methods fill:none, stroke:none, title:none

Counterfactual fairness (Kushner et al., 2017)

A predictor \(\hat{Y}\) is counterfactually fair if, for any individual with observed attributes \(A = a\) (protected attribute) and \(X = x\) (remaining attributes), and for any other possible value \(a'\) of \(A\).

\[\begin{align} \begin{split} P\left( \hat{Y}_{A \leftarrow a} = y \mid A = a, X = x \right) &= P\left( \hat{Y}_{A \leftarrow a'} = y \mid A = a, X = x \right), \\ &\quad \text{for all } y. \label{eq:counterfactual} \end{split} \end{align}\]

Where:

  • \(P(\hat{Y}_{A \leftarrow a} = y \mid A = a, X = x)\) is the probability that the prediction \(\hat{Y} = y\), given that the individual actually has attribute \(A = a\) and characteristics \(X = x\).

  • \(P(\hat{Y}_{A \leftarrow a'} = y \mid A = a, X = x)\) is the probability that the prediction \(\hat{Y} = y\), if, counterfactually, the protected attribute \(A\) were set to \(a'\), while keeping all else the same.

Example: AI CV screener

  • Qualifications ✔️
  • Work experience ✔️
  • Skills ✔️
  • Gender ❌
  • Ethnicity ❌
  • Pregnancy ❌

How does this apply to adult social care?

Use LLM to change gender

Mrs Smith is a 87 year old, white British woman with reduced mobility. She lives in a one-bedroom flat. She requires support with washing and dressing. She has three care calls a day.

Mr Smith is a 87 year old, white British man with reduced mobility. He lives in a one-bedroom flat. He requires support with washing and dressing. He has three care calls a day.

Caveat: not all notes translate

  • Domestic violence
  • Prostate cancer
  • Mastectomy

Removed notes with sex-specific body parts or domestic abuse.

Summarisation models

  • Large language models:
    • Gemma (Google, 2024): 8bn parameters
    • Llama 3 (Meta, 2024): 7bn parameters

Summarisation models

  • Large language models:
    • Gemma (Google, 2024): 8bn parameters
    • Llama 3 (Meta, 2024): 7bn parameters
  • Benchmark models:
    • T5 (Google, 2019): 220m parameters
    • BART (Meta, 2019): 406m parameters

How do you compare free text summaries?

Strategy

Use LLMs to create summaries of case notes and measure:

  1. Sentiment analysis.
  2. Inclusion bias4: count of words related to themes:
  • physical health
  • mental health
  • physical appearance
  • subjective language
  1. Linguistic bias5: count of all words used for men and women.

Metrics

  1. Sentiment analysis
    • SiEBERT - a general purpose, pre-trained, binary sentiment analysis model.
    • Regard - a pre-trained metric was designed for the purpose of evaluating gender bias across texts.

\[\begin{align*} \text{sentiment}_{ij} &= \beta_0 + \beta_1 \cdot \text{model}_i + \beta_2 \cdot \text{gender}_j \\ &\quad + \beta_3 \cdot (\text{model}_i \times \text{gender}_j) + \beta_4 \cdot \text{max_tokens}_i \\ &\quad + u_{0j} + u_{1j} \cdot \text{model}_i + \epsilon_{ij} \label{eq:summarieslmm} \end{align*}\]

  1. Counts of words, themes
    • \(\chi^2\) test
    • Poisson regression

\[\begin{align*} \text{count}_{i} &= \beta_0 + \beta_1 \cdot \text{gender}_i + \beta_2 \cdot \text{max_tokens}_i \\ &\quad + \beta_3 \cdot \text{doc_id}_i + \epsilon_i \label{eq:worddtregression} \end{align*}\]

Results

Sentiment analysis: estimated marginal means (female - male)

Model
Regard
SiEBERT
Estimate t p Estimate t p
Benchmark models
bart -0.0036 . -2.0 0.05100 0.0094 * 2.2 0.031
t5 -0.0049 ** -2.7 0.00720 -0.01 * -2.3 0.019
State-of-the-art models
llama3 -0.0021 -1.2 0.25000 -0.0055 -1.3 0.200
gemma 0.0069 *** 3.8 0.00013 0.042 *** 9.7 0.000

Frequency of themes

Word counts

Word counts: Gemma

Word N (women) N (men) p-value (adj.)
Words used more for men
require 1498 1845 *** < 0.001
receive 554 734 *** < 0.001
resident 298 421 *** 0.001
able 689 848 *** 0.005
unable 276 373 *** 0.013
complex 105 167 *** 0.017
disabled 1 18 *** 0.008
Words used more for women
text 5042 2726 *** < 0.001
describe 3295 1764 *** < 0.001
highlight 1084 588 *** < 0.001
mention 314 136 *** < 0.001
despite 753 478 *** < 0.001
situation 819 538 *** < 0.001

Examples

  • Linguistic bias
  • Inclusion bias

Linguistic bias

Linguistic bias: Gemma

Mr. Smith has dementia and is unable to meet his needs at home.

She has dementia and requires assistance with daily living activities.

Linguistic bias: Gemma

Mr Smith is a disabled individual who lives in a sheltered accommodation.

The text describes Mrs. Smith’s current living situation and her care needs.

Inclusion bias

Gemma: inclusion bias

Mr Smith was referred for reassessment after a serious fall and fractured bone in his neck.

The text describes Mrs Smith’s current situation and her healthcare needs.

Gemma: inclusion bias

Mr. Smith is a 78 year old man with a complex medical history.

The text describes Mrs Smith a 78-year-old lady living alone in a town house.

Policy implications

  • Gemma: The man-flu effect?
  • Cases are prioritised on the basis of severity.
  • Care allocated on basis of need.

Llama 3

Recommendations: regulatory clarity

If goal is fairness in LLMs: mandate evaluation of bias through regulation.

  1. Data Protection Act (2018) and General Data Protection Regulation (GDPR):
    • Permits predictive modelling (“profiling”) without consent if legitimate public interest.
    • Prohibits automated decision-making.
  2. Medical Device Regulations 2002 ❌.
  3. UK AI Bill forthcoming.

Regulatory clarity: how?

  • Which domains? gender, ethnicity, socioeconomic status…
  • Who should bear costs of evaluation?
  • How do you evaluate bias?
    • Qualitative methods.
    • Quantitative methods: this is reproducible - code on GitHub.

Resources

Paper (pre-print)

GitHub
Figure 1

Footnotes

  1. Local government: State of the sector: Ai research report. Technical report, Local Government Association, 2024. URL https://web.archive.org/web/20240906174435/https: //www.local.gov.uk/sites/default/files/documents/Local%20Government%20State% 20of%20the%20Sector%20AI%20Research%20Report%202024%20-%20UPDATED_3. pdf. Accessed: 2024-09-06.

  2. Local government: State of the sector: Ai research report. Technical report, Local Government Association, 2024. URL https://web.archive.org/web/20240906174435/https: //www.local.gov.uk/sites/default/files/documents/Local%20Government%20State% 20of%20the%20Sector%20AI%20Research%20Report%202024%20-%20UPDATED_3. pdf. Accessed: 2024-09-06.

  3. Booth, R. (2024), “Social workers in England begin using AI system to assist their work”, The Guardian, 30th September. Available at: https://www.theguardian.com/society/2024/sep/28/social-workers-england-ai-system-magic-notes? (Accessed 11th October 2024)

  4. Steen, J. and Markert, K., 2023. Investigating gender bias in news summarization. arXiv preprint arXiv:2309.08047.

  5. Caliskan, A., Bryson, J.J. and Narayanan, A., 2017. Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334), pp.183-186.

document.addEventListener("DOMContentLoaded", function() { Reveal.addEventListener('slidechanged', function(event) { // Disable any transitions by setting the duration to 0 Reveal.configure({ transition: 'none', transitionSpeed: 'fast' }); }); });